Use of artificial intelligence for the prediction of lymph node metastases in early-stage colorectal cancer: systematic review

Abstract Background Risk evaluation of lymph node metastasis for early-stage (T1 and T2) colorectal cancers is critical for determining therapeutic strategies. Traditional methods of lymph node metastasis prediction have limited accuracy. This systematic review aimed to review the potential of artificial intelligence in predicting lymph node metastasis in early-stage colorectal cancers. Methods A comprehensive search was performed of papers that evaluated the potential of artificial intelligence in predicting lymph node metastasis in early-stage colorectal cancers. Studies were appraised using the Joanna Briggs Institute tools. The primary outcome was summarizing artificial intelligence models and their accuracy. Secondary outcomes included influential variables and strategies to address challenges. Results Of 3190 screened manuscripts, 11 were included, involving 8648 patients from 1996 to 2023. Due to diverse artificial intelligence models and varied metrics, no data synthesis was performed. Models included random forest algorithms, support vector machine, deep learning, artificial neural network, convolutional neural network and least absolute shrinkage and selection operator regression. Artificial intelligence models’ area under the curve values ranged from 0.74 to 0.9993 (slide level) and 0.9476 to 0.9956 (single-node level), outperforming traditional clinical guidelines. Conclusion Artificial intelligence models show promise in predicting lymph node metastasis in early-stage colorectal cancers, potentially refining clinical decisions and improving outcomes. PROSPERO registration number CRD42023409094.


Introduction
Colorectal cancer (CRC) is a leading cause of cancer-related death worldwide, and with the introduction of population-based screening programmes, there has been an increase in the diagnostic incidence of early (T1 and T2) CRC 1,2 .The presence of lymph node metastases (LNM) serves as a crucial prognostic indicator in determining if patients with early-stage CRC require additional surgical intervention following endoscopic resection, and if adjuvant chemotherapy is indicated after surgical resection for patients with advanced-stage disease [3][4][5] .
Current nodal status evaluation in patients with CRC relies on radiological imaging data such as magnetic resonance imaging or computed tomography (CT), and histopathological examination of endoscopic biopsies [6][7][8] .Several histologic risk factors have been proposed as predictors of regional LNM, including the presence of lymph vascular invasion, tumour budding, deep submucosal invasion and poorly differentiated cell clusters [9][10][11] .However, qualitative evaluation of pathological features coupled with interobserver variability among pathologists 12 , renders this approach inadequate for accurately predicting LNM in patients with CRC.In order to overcome these issues, a more reproducible and accurate prediction tool needs to be developed.
Artificial intelligence (AI) techniques including machine learning (ML) algorithms such as random forest (RF) decision trees and support vector machines (SVMs), as well as deep learning (DL) models, like convolutional neural networks (CNN), have emerged as promising avenues to improve oncological diagnosis and prognosis [13][14][15][16][17][18] .Specifically, using computer vision (CV) models such as CNN to analyse whole slide images (WSI) enables an in-depth evaluation of tissue morphology through automated feature extraction 19 .In comparison, ML algorithms assess human-derived factors such as clinical and histological variables to make their predictions.As AI-powered tools have the ability to process vast amounts of intricate data and identify hidden patterns, they have the potential to offer more accurate predictions of LNM than conventional methods.In recent years, the application of AI in cancer research has witnessed exponential growth, with research focusing on harnessing its potential to improve clinical management and patient outcomes [20][21][22] .The decision to proceed with colon resection after

Search strategy and study selection
All publications were identified through a systematic search of the following five databases: PubMed, MEDLINE, Embase, Cumulative Index to Nursing and Allied Health Literature (CINAHL) and Cochrane Central Register of Controlled Trials.Search strategies for this review combined search terms that included (artificial intelligence OR neural networks OR machine learning OR deep learning) AND (colorectal neoplasms OR colorectal cancer) AND (lymph nodes OR lymph node metastases) using a combination of subject headings and keywords in the title or abstract (Supplementary materials).Boolean operators AND/OR connected the search terms.The initial search strategy was developed for use in MEDLINE and was then further adapted to be utilized in the other databases.
Original studies that reported the use of AI for the prediction of LNM in early-stage CRC in patients older than years of age were included.All articles published up until March 2023 were included and publications that were abstract only or not written in the English language were excluded.Further studies were excluded if participants had more than one malignant polyp or other synchronous colorectal malignancies or had a   previous history of CRC.The reference lists of the manuscripts identified using the search strategies were reviewed for studies that were not captured in the initial search.All search strategies were carried out on 2 August 2023.Search records were exported into Covidence (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia) for removal of duplications and subsequent screening.

Outcomes of interest
The primary outcome of the review was to summarize the different AI models utilized to predict LNM in early-stage CRC and describe the accuracy of these models.Secondary outcomes included identifying variables that influenced the performance of these models and exploring the potential strategies utilized to address or mitigate these challenges.

Data extraction
Two independent reviewers performed the first screening of publications, reviewing titles and abstracts.The records that were accepted in the initial screen were further reviewed using the manuscript full texts based on the eligibility criteria.Two reviewers critically appraised the studies, and the quality of the studies was evaluated using the Joanna Briggs Institute (JBI) critical appraisal tools.One reviewer collected the data from the studies.
All disagreements were resolved by discussion and a consensus was reached.

Data analysis
As there was variability among the AI software utilized to assess LNM and some of the outcome measures were qualitative, a quantitative synthesis was not performed.The results were tabulated, and the outcomes were summarized.Each study was assessed according to its design, year, location, number of participants, AI program utilized, accuracy of the AI model and factors that influenced the reliability of the model.A critical appraisal of each manuscript was undertaken using the JBI tool which assesses the quality and applicability of studies by systematically reviewing the bias of the methodology and subsequent analyses.

Search results
After the removal of duplications, our search strategy yielded a total of 3190 citations for abstract and title review, resulting in 39 full-text reviews, of which 11 were eligible and included in the final analysis (Fig. 1).All 11 studies were retrospective by design; three of these studies were single-centre [24][25][26] and the other eight were multicentre cohort studies [27][28][29][30][31][32][33][34] .The collection intervals of the nine studies ranged from 1996 to 2023, with a median range of 2001 to 2016.The total number of patients across all studies was 8648, with the largest study involving 4073 30 patients and the smallest involving 146 patients 29 .The largest study was undertaken in Japan, while the smallest was conducted in Taiwan.The studies were conducted in a variety of countries, including Japan, Taiwan, South Korea, Denmark and the USA; it is worth noting that some of the studies were conducted in non-English speaking countries.
Seven studies included patients who underwent surgical resection 24,27,28,30,32,33 , while three studies included patients who underwent endoscopic resection 29,31,33 and one study included both 25 .In terms of cancer stage, eight studies exclusively included patients with T1 cancer, one study exclusively included patients with T2 cancer, and two studies included both T1 and T2 cancers, Table 1.

Quality of the studies
The quality of all the studies was evaluated using the JBI scoring checklists specific to each study design.The studies were rated based on the number of fulfilled criteria, excluding criteria that were not relevant to the study.All of the studies included in the review satisfied a high proportion of criteria ranging from 87.5 to 100%, which suggests a low level of risk of bias.The results of these assessments can be seen in the Supplementary materials.

AI algorithms and models used
The 11 studies reviewed employed various AI models and approaches for predicting cancer prognosis.These fell into two distinct categories, those that used CV-based assessment of histopathological images (Table 2) and the remainder employed AI analysis of clinicopathological features (Table 3).Techniques such as attention-based DL and CNNs, alongside feature extraction from WSIs, enable precise integration of clinical variables and image-based features, streamlining the diagnostic process.In the studies that utilized AI for direct analysis of histopathological images, CNN were predominantly employed for metastasis detection or for classifying cancerous regions by histologic type [31][32][33][34] .Preprocessing steps in these studies included data augmentation 28,30 and techniques like segmentation of cell nuclei and pixel thresholding to analyse large histopathological images.Attention-based DL models are used to rank the relative importance of tissue regions in a haematoxylin and eosin WSI and aggregate information for making a prediction 35 .Preprocessing steps include normalization of features to increase classification accuracy.
In the study conducted by Brockmoeller et al. 24 the researchers employed deep neural network models specifically trained on haematoxylin and eosin-stained WSI to predict LNM in patients with CRC.They utilized the ShuffleNet network model, which is known for its efficiency and accuracy in image classification tasks.They also used transfer learning, a technique that leverages pretrained models on a new but related task.This approach was used to reduce the time required to train the model and the data requirements, which is especially beneficial when dealing with medical images where data can be scarce.The study aimed to  The quality of the histopathological data such as poor image quality, bad hematoxylin and eosin staining, duplicated images and artefacts were recognized to impact the data set.Additionally, the choice of algorithm and input features were also noted to influence AI performance Exclusion of inadequate image data: slides with poor image quality, bad hematoxylin and eosin staining, duplicated images and artefacts were excluded.Use of U-Net architecture: the U-Net architecture was chosen for its ability to improve the performance of fine segmentation and localization, particularly for biomedical images Ichimasa et al.,  2022   The study found that lymphatic invasion was the most influential factor for LNM among the eight examined factors The study did not explicitly address factors that could impact the performance of the AI models, such as the size and quality of the training data set or the choice of algorithm.Future studies could address these factors by using larger and more diverse data sets, exploring different algorithms and conducting sensitivity analyses to assess the robustness of the models Kang et al., 2020  Factors that were found to impact the performance of the AI models included the choice of input features (TIL subtypes and clinicopathologic parameters) and the use of the LASSO algorithm for feature selection.The size and quality of the training data set may also have an impact on model performance In an attempt to address these factors Kang et al. (2020)  selected relevant clinicopathologic parameters and TIL subtypes for inclusion in the LASSO model and utilized cross-validation to optimize the hyperparameters of their algorithm.As this study used a relatively small sample size, future studies would benefit from larger data sets to further validate the performance of their AI model.Additionally, in this study, co-morbidities were not considered as potential confounding factors which could impact model performance and should look to be addressed in future studies Kasahara et al.,  2022   This study found that one of their biggest limitations was the small number of centres included; different staining methods and specimen characteristics could significantly affect the results Future studies based on prospective multicentre sites are required to validate and expand on these findings

Kudo et al., 2020
The authors noted the data set was heavily imbalanced with the dependent variables.Other biases acknowledged were the use of surgically resected tissue versus endoscopic and the type of clinicopathological data sets being used As the data was heavily imbalanced this study utilized a weighting regularizer to address the dependent variable imbalance and by performing a hyper-parameter evaluation to obtain the optimum parameter set.They also utilized various types of clinicopathological data using larger numbers of samples.Future studies could address these factors by using larger and higher quality data sets, comparing different algorithms and input features, and performing more rigorous validation methods (continued) predict various parameters directly from the WSI, such as the presence of LNM status, the number of examined LN being positive, and the DNA mismatch repair status (dMMR).They also distinguished between different subgroups of CRC, namely pedunculated and sessile pT1 CRC.To ensure the quality and consistency of the input data, the images underwent a rigorous preprocessing regimen.This involved extracting tiles of 512 × 512 pixels in size from the WSIs and removing tiles containing background or artefacts based on Canny edge detection.Furthermore, to address the common issue of colour variations in histopathological images, they employed the Macenko colour normalization method, ensuring consistent colour representation across all images.
Other studies utilized ML techniques that incorporated both clinical and histopathological features.Random forest classifiers (RFC) and DL models were commonly used in this category 36 .Takamatsu et al. 32 used a supervised ML approach to predict LNM based on 16 parameters in a data set of 397 patients with CRC.The data set was randomly split into a training set (70%, n = 277) and a test set (30%, n = 120) with similar LNM rates.The RFC ML algorithm was selected to minimize overfitting 32 .The RFC then creates a set of decision trees from randomly selected parameters of the training data set; the computer-learned patterns were then used to predict cases as positive or negative for LNM.
Kudo et al. developed a predictive model using eight factors: patient age, sex, tumour size, tumour location, morphology, lymphatic invasion, vascular invasion and histological grade 34 .An artificial neural network (ANN) then utilized this model to predict the likelihood and risk of LNM metastases.ANN models are comprised of various interconnected neurons that share information and each neuron is weighted at a different level of significance which can be updated through a training process 37,38 .This allows the ANN to continually develop relationships between input and output variables.The network was trained by iteratively changing the class weight and a hyperparameter evaluation was performed to obtain the optimum parameter set.

Model performance metrics
Tables 4 and 5 summarize the performance metrics that were utilized by the studies to evaluate the accuracy of the AI models and the comparative analyses employed to assess their models' performance against traditional methods.The performance metrics such as area under the curve (AUC), sensitivity, specificity and accuracy were employed across studies to evaluate AI models.Comparatively, AI models were assessed against traditional methods, with certain studies noting AI's capability of enhancing predictive precision while potentially reducing the need for unnecessary interventions, as indicated by specific comparisons and statistical analyses.Kasahara et al. also demonstrated high levels of accuracy utilizing a regions of interest (ROI)-based discrimination of LNM risk, with their SVM and RF models achieving a case-by-case accuracy of 81.8% and 86.8% respectively.When validating their prediction models Kudo et al. and Song et al. observed improved specificity of their models compared with current US and Japanese Society for Cancer of the Colon and Rectum (JSCCR) guidelines, with the AI model having safely reduced 15.1% of unnecessary additional surgeries that would be indicated when using current management guidelines.
Takamatsu et al. reported no significant difference in their cross-validation study, demonstrating AUCs of 0.938 and 0.826 for ML and conventional methods respectively.In a more recent study, Takamatsu et

Factors affecting AI model performance
Several factors were found to influence the performance of the AI models across the nine studies (Table 6).When reviewing input features Kang et al. demonstrated that tumour-infiltrating lymphocyte (TIL) subtypes, clinicopathologic parameters and the use of the LASSO algorithm for feature selection all influenced the accuracy of the prediction model 28 .In this particular study, this factor was addressed through cross-validation to optimize the hyperparameters of the LASSO algorithm.Overall, the size and quality of the training data set

Discussion
With the rising diagnostic incidence of early-stage (T1 and T2) CRC, accurate prediction of LNM has become increasingly critical in optimizing therapeutic strategies 3,5 .This review highlights the potential of different AI models in predicting the presence of LNM in early-stage CRC.AI prediction models may offer improved accuracy leading to earlier detection, enhanced treatment planning and overall improved outcomes for patients.However, variability among the models used, algorithm selection and input features strongly influence the performance and generalizability of the AI models.In this review, AI models employed to predict LNM in early-stage CRC were broadly categorized into two main groups: those focusing on image analysis using CV algorithms and those centred on AI analysis of clinicopathological features.For the former group, researchers utilized DL models to analyse WSIs and histopathological data 24,26,31,34 .Brockmoeller et al. highlighted the ability of AI to predict LNM status from tumour slides without the addition of manual annotations.Their study underscored the significance of high-quality training data and selection of algorithms utilized 24 .Furthermore, Kwak et al. emphasized how poor image quality, bad haematoxylin and eosin staining, duplicated images and artefacts within the data set can influence image analysis and the subsequent choice of DL architecture selected.In the latter group, these studies predominantly employed ML techniques, such as RF algorithms, ANN and LASSO regression, focusing on the incorporation of clinicopathological variables 25,[30][31][32][33] .
Ichimasa et al. incorporated 45 clinicopathological factors in their SVM model, demonstrating the complexity and diversity of input features utilized to achieve robust predictions 25 .An advantage of this approach is that it allows clinicians to better comprehend the analysis results and enables them to provide patients with an explanation as to why the algorithm arrived at these conclusions.Having a sufficient understanding of the algorithms utilized to predict LNM is critical for the successful integration of ML in clinical medicine, benefiting both patients and clinicians in guiding well-informed treatment decision-making.
Regardless of the approach utilized, interobserver disagreements on the prediction of LNM based on histopathological risk factors have remained a challenge 39,40 ; as such the diagnostic reproducibility of AI models is critical to generate a stable and reliable prediction model.Similar results were also demonstrated in the other articles included in this review [27][28][29][30][31] .The predictive ability of these models was ascertained through performance metrics such as the area under the receiver operating characteristic curve (AUC).Overall, the AI models were shown to demonstrate a high level of accuracy and potential for the reduction of subsequent unnecessary resections when compared with conventional guidelines [27][28][29][30] .However, the heterogeneity observed across the different study designs, patient population numbers, AI models and performance metrics utilized in the individual studies is likely to impact the generalizability of these findings.
AI models have demonstrated promising potential in improving the accuracy of LNM prediction in early-stage CRC, often outperforming traditional guidelines 27,31,41 .These differences would likely result in earlier detection, enhanced treatment planning and overall improved outcomes for patients.Specifically, the AI model utilized by Ichimasa et al. demonstrated a significantly higher specificity when compared with the NCCN, ESMO and JSCCR guidelines, while maintaining a sensitivity and negative predictive value (NPV) of 100% 25 .These results translated to a potential reduction in unnecessary operations by 8-14% 25 .Another study that utilized an ANN model on 3134 patients with T1 CRC achieved an AUC of 0.83 in their validation cohort, which outperformed the clinical guidelines AUC of 0.73 30 .These models highlight the potential of AI to reduce over-surgery rates and improve patient outcomes through enhanced metastasis detection in clinical settings.Despite these advantages, some of the limitations of these models include the need for large high-quality data sets that are required to train the models, ambiguity surrounding interpretability, and the possibility of the introduction of biases through the initial data collection and model development stages [27][28][29] .Additionally, extensive computational resources and expertise may be required to be able to successfully implement and maintain these programs.
Numerous factors were identified to influence the performance of AI models in predicting LNM, including histological subtypes, size of the lesion, algorithm selection, input features chosen, as well as the size and quality of the data set utilized to train the model.The choice of TIL subtypes 28 , incorporation of clinical parameters 30,31,33 and utilization of the LASSO algorithm to guide feature selection also influenced the models' overall performance.Of note, smaller sample sizes in some of the studies and recognition of particular confounding factors such as patient co-morbidities, variation in staining methods and specimen characteristics may also have an influence on the AI models' accuracy and robustness 29 .To address these challenges, it is essential to incorporate a more diverse set of training samples, particularly to include both isolated tumour cells and underrepresented histologic subtypes.Enhancing model performance and ensuring the generalizability of these approaches necessitate further refinement of the feature selection process to allow optimization of the hyperparameters of a given algorithm.Increasing the size of quality training data sets and recognition of potential confounding factors will assist in improving the performance of these models, ultimately enhancing the reliability and accuracy of the prediction of LNM.
Future directions of the use of AI in the prediction of LNM in early-stage CRC should focus on addressing the limited sample sizes available, lack of diversity amongst the training data sets and inadequate evaluation and application of these AI models within the clinical setting.In an attempt to reduce the training time and enhance the generalizability of the models, a key area of exploration should investigate the use of transfer learning, where pretrained models are then fine-tuned on CRC-specific data sets 42 .Additionally, new methodology innovations are being developed that will be of assistance in addressing limited sample sizes and performance of AI models, for example generative adversarial networks (GANs) and vision transformers (ViTs).GANs are trained on the real original images of the sample data set, which the GANs use to synthetically produce new images that are similar in appearance to the original input images.These GANs can also be conditioned with certain attributes or labels from the original sample data set to produce synthetic images that are otherwise identical to the original sample including molecular phenotypic information 43 .ViTs are a type of neural network based on a transformer architecture which were originally utilized in natural language processing tasks.ViTs are more flexible than CNNs in their ability to learn information through self-attention mechanisms.ViTs are not only able to learn features from nearby image pixels (in the form of patches), but also learn features from distant image pixels/patches within the image that contribute to the overall image classification label.ViTs will be useful to learn and understand the significance of distal patterns within larger image regions 43,44 .
Furthermore, the incorporation of ensemble methods which function to combine the strengths of multiple different AI models are likely to enhance predictive accuracy and robustness 45,46 .Incorporating a wider range of patient variables may also enhance the models' predictive power and relevance.Specifically, the inclusion of radiomic features, genomic and proteomic data such as gene expression profiles and cellular biomarkers may provide assistance in recognizing the molecular signatures that are associated with the presence or likelihood of LNM 47 .Evaluating these models in an array of settings and diverse patient populations is critical in ascertaining the real-world applicability of these models.Design and delivery of larger scale, multicentre studies applying these models prospectively on diverse cohorts will assist in the validation process.To facilitate this, the use of federated learning and swarm learning approaches presents opportunities to provide high-quality data for training without the necessity to exchange personal identifiable information about patients, which avoids practical and legal issues surrounding data sharing and data sovereignty for institutions.Federated learning involves several AI models trained on computers independently from each other using separate data sets.During training, model updates are fed-back to a centralized server which updates the model's parameters (weights and biases) based on all the data sets and passes the updated, learned model parameters in a coordinated manner back to participants to continue with their AI model training and validation processes 43,48 .Swarm learning differs from federated learning in that several parties can co-train an AI model together through a coordinated blockchain-based communication.This approach allows exchanging AI model updates directly amongst peers without the need for passing data through a controlled, centralized server.Several advantages are provided with the use of Swarm learning, first, the AI model is not affected through the loss of a single party training the AI model.Second, control of the AI model development does not solely lie with a single party but is distributed amongst all parties.Third, data security and privacy is preserved at a local site.Finally, Swarm learning promotes equality by facilitating data sharing and collaboration amongst research parties 43,49 .
This review has several limitations that should be acknowledged.As all of the studies included in this review were retrospective, there is potential for selection bias due to reliance on pre-existing data of a study population that may not be representative of the broader population.There is also a lack of control over the quality of the data that was obtained and confounding variables, as well as temporal ambiguity when establishing the correct temporal relationship.There was significant heterogeneity in the AI models used to predict LNM and there was a wide range of years the data was collected from; collectively, conclusions drawn from these results need to be interpreted with caution.The restriction to articles only written in the English language could also introduce bias through the omission of possibly relevant articles that were excluded if they were written in another language.
Overall, the application of AI in predicting LNM in early-stage CRC has shown significant promise.Risk evaluation of LNM in CRC has critical implications in guiding therapeutic management and reducing overtreatment of early-stage disease.Further research with larger, multicentre prospective studies inclusive of diverse populations is essential to further ascertain and validate the predictive potential of AI in this setting.

1 Reports assessed for eligibility n = 39 Fig. 1
Fig. 1 PRISMA diagram CINAHL, Cumulative Index to Nursing and Allied Health Literature.

Table 2 Artificial intelligence algorithms and models Image analysis using a CV algorithm Authors AI models used Features and variables used by these models (for example clinical, histopathological and radiological data) Preprocessing steps (for example data normalization, feature selection or augmentation)
CV, computer vision; AI, artificial intelligence; DL, deep learning; DCNN, deep convolution neural network; WSI, whole slide imaging; AM, attention module; CRC, colorectal cancer; SM, submucosal; LVI, lymphovascular invasion; FE, finite element; FV, feature vector; CNN, convolutional neural network; RFC, random forest classifiers; LNM, lymph node metastases; RF, random forest; LN, lymph node; dMMR, mismatch repair deficient; SVM, support vector machines.

Table 3 Artificial intelligence algorithms and models AI analysis of clinicopathological features Authors AI models used Features and variables used by these models (for example clinical, histopathological and radiological data) Preprocessing steps (for example data normalization, feature selection or augmentation)
The following features and variables were used by the ANN model: patients' age, sex, tumour size, location, morphology, lymphatic invasion, vascular invasion and histologic grade Data normalization was performed on age and tumour size, which were normalized to be in a range (0,1) for classification accuracy Takamatsu et al., 2019 The study employed a RFC machine learning algorithm The largest slice of each tumour was selected for evaluation of the following histologic factors: tumour depth, lymphatic invasion, venous invasion, poorly differentiated clusters and tumour budding Preprocessing steps included image resolution equalization, deletion of non-cancerous areas, binary image conversion and morphological analyses AI, artificial intelligence; RF, random forest; CEA, carcinoembryonic antigen; LASSO, least absolute shrinkage and selection operator; IHC, immunohistochemical; MMR, markers for mismatch repair; TILs, tumour-infiltrating lymphocytes; ANN, artificial neural network; RFC, random forest classifiers; SVM, support vector machines.

Table 4 Model performance and metrics Image analysis using a CV algorithm Authors Performance metrics used in the included studies to evaluate the AI models, such as sensitivity, specificity, accuracy, AUC and F1 score Comparison of the AI models' performance to traditional methods, if applicable (for example clinical nomograms, TNM staging system)
, computer vision; AI, artificial intelligence; AUC, area under the curve; TNM, tumour node metastases; ROC, receiver operating characteristic; LNM, lymph node metastases; CRC, colorectal cancer; RFC, random forest classifiers; JSCCR, Japanese Society for Cancer of the Colon and Rectum; RF, random forest; AUROC, area under the receiver operating characteristics; DL, deep learning; TIL, tumour-infiltrating lymphocyte; LASSO, least absolute shrinkage and selection operator; ROI, regions of interest; SVM, support vector machines.
The prediction of the presence of more than one LNM in all pT1 CRCs had a cross-validated AUROC of 0.733 (0.67-0.758) and patients with any LNM had an AUROC of 0.711 (0.597-0.797).The prediction of one or any LNM status in patients with pT1 had an AUROC of 0.733 (0.644-0.778) and 0.567 (0.542-0.597) respectively A multivariate linear regression was performed to compare how DL prediction of LNM compares with established histopathological risk factors.In the pT1 cohort, the AI score (normalized between 0 and 1) and 'venous invasion' (with values 0 and 1) achieved high coefficients for the presence of LNM (b = 0.35, P = 0.07 and b = 0.28, P < 0.CV

Table 5 Model performance and metrics AI analysis of clinicopathological features Authors Performance metrics used in the included studies to evaluate the AI models, such as sensitivity, specificity, accuracy, AUC and F1 score Comparison of the AI models' performance to traditional methods, if applicable (for example clinical nomograms, TNM staging system)
AI, artificial intelligence; AUC, area under the curve; TNM, tumour node metastases; ANN, artificial neural network; LNM, lymph node metastases; CRC, colorectal cancer; ROC, receiver operating characteristic; SVM, support vector machine; DSC, dice similarity coefficient; PTS, predictive value of the peri-tumoral stroma.

Table 6 Factors affecting artificial intelligence model performance Authors Factors that were found to impact the performance of the AI models, such as the size and quality of the training data set, the choice of algorithm or the use of different input features How these factors were addressed or could be addressed in future studies
Deep neural network models were trained on all of the available hematoxylin and eosinWSIs without restriction of tissue types or the areas used.This allows an unbiased learning system with regards to tissue region in which to detect the predictive factors.To reduce the bias caused by artefacts, tiles containing these were removed based on canny edge detection in Python's OpenCV packageKwak et al., 2020 Thompson et al. | 7

Table 6 (continued) Authors Factors that were found to impact the performance of the AI models, such as the size and quality of the training data set, the choice of algorithm or the use of different input features How these factors were addressed or could be addressed in future studies
Thompson et al. | 9and the inclusion of clinicopathological features were consistently found to potentially impact AI model performance.Brockmoeller et al. highlighted the significance of having high-quality training data and, in particular, the relevance that the training images have to the prediction task.Their study also underscored the importance of using WSI without manual annotations, and how the subsequent choice of DL algorithm may impact model performance.By directly predicting LNM status from primary tumour slides without manual annotations, they showcased the potential of AI in histopathological analysis.Their findings suggest that future studies could delve deeper into the impact of manual annotations on model performance, explore the benefits of larger data sets and compare different DL architectures to optimize results.